Validation of scientific topic models using graph analysis and corpus metadata
نویسندگان
چکیده
Abstract Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting appropriate stable model a specific application (by adjusting hyperparameters algorithm) is not trivial problem. Common validation metrics coherence perplexity, which are focused on quality topics, good fit applications where document similarity relations inferred from especially relevant. Relying graph techniques, aim our work to state new methodology selection specifically oriented optimize emanating model. In order do this, we propose two metrics: first measures variability graphs that result different runs algorithm fixed value hyperparameters, while second metric alignment between derived LDA another obtained using metadata available corresponding corpus. Through experiments various corpora related STI, it shown proposed provide relevant indicators select number topics build persistent models consistent with metadata. Their use, can be extended other beyond LDA, could facilitate systematic adoption this kind techniques STI design.
منابع مشابه
Towards Using Wikipedia as a Substitute Corpus for Topic Detection and Metadata Generation in E-Learning
Metadata is crucial for reuse of Learning Resources. Only with good metadata, there is a chance that a Learning Resource can be successfully found in a repository. However, many Learning Resources are still delivered with no or little attached metadata. Automatic metadata generation is used to put things right either as assistance for the author, or as part of a repository’s retrieval functiona...
متن کاملTopic Models for Corpus-centric Knowledge Generalization
Many of the previous efforts in generalizing over knowledge extracted from text have relied on the use of manually created word sense hierarchies, such as WordNet. We present initial results on generalizing over textually derived knowledge, through the use of the LDA topic model framework, as the first step towards automatically building corpus specific ontologies. This work was funded by a 200...
متن کاملAnalysis of Retweeting Behavior Using Topic Models
Social networks are nowadays a constant presence in our lives and increasingly have a role in important social and commercial phenomena. Microblogging services such as Twitter appear to play an important role in the process of information dissemination on the Internet making it possible for messages to spread virally in a matter of minutes. In this research work we study the mechanism of re-bro...
متن کاملImproved Analysis and Trace Validation Using Metadata Snapshots
One of the most fundamental storage system research tasks is activity tracing. By understanding the behavior of a running system we can accomplish a variety of tasks ranging from debugging and system validation, to proposing techniques to improve the performance of future systems. Yet, before we can do an effective analysis of a trace, we must first understand what activities are and are not ca...
متن کاملIncorporating Metadata into Dynamic Topic Analysis
Everyday millions of blogs and micro-blogs are posted on the Internet These posts usually come with useful metadata, such as tags, authors, locations, etc. Much of these data are highly specific or personalized. Tracking the evolution of these data helps us to discover trending topics and users’ interests, which are key factors in recommendation and advertisement placement systems. In this pape...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Scientometrics
سال: 2022
ISSN: ['1588-2861', '0138-9130']
DOI: https://doi.org/10.1007/s11192-022-04318-5